{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 嵌入模型微调\n",
    "\n",
    "👏🏻 欢迎来到 uniem 的微调教程，在这里您将学习到：\n",
    "\n",
    "1. `FineTuner` 支持的数据和数据集类型\n",
    "2. 如何使用 `FineTuner` 对 M3E 进行微调\n",
    "3. 如何使用 `FineTuner` 对 sentence_transformers 的模型进行微调\n",
    "4. 如何使用 `FineTuner` 对 huggingface 中的预训练模型从头训练 Embedding 模型\n",
    "5. 如何使用 `FineTuner` 对 gpt 系列的模型进行 SGPT 式的训练\n",
    "\n",
    "如果您是在 colab 环境中运行，请使用 GPU 运行时。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 最开始肯定是安装 `uniem` 库了 😉\n",
    "!pip install uniem"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`uniem` 已经安装完成了，让先看一个简单的例子，来感受一下微调的过程。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "from uniem.finetuner import FineTuner\n",
    "\n",
    "dataset = load_dataset('shibing624/nli_zh', 'STS-B', cache_dir='cache')\n",
    "finetuner = FineTuner.from_pretrained('moka-ai/m3e-small', dataset=dataset)\n",
    "finetuned_model = finetuner.run(epochs=3, batch_size=64, lr=3e-5)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "🎉 微调已经完成了，通过 `FineTuner` 我们只需要几行代码就可以完成微调，就像魔法一样！\n",
    "\n",
    "让我们看看这背后发生了什么，为什么可以这么简单？\n",
    "\n",
    "1. `FineTuner` 会自动根据名称识别和加载模型，您只需要声明即可，就像例子中的 `moka-ai/m3e-small`，这会被识别为 M3E 类模型，`FinTuner` 还支持 sentence-transformers, text2vec 等模型\n",
    "2. `FineTuner` 会自动识别数据格式，只要您的数据类型在 `FineTuner` 支持的范围内，`FineTuner` 就会自动识别并加以使用\n",
    "3. `FineTuner` 会自动选择训练方式，`FineTuner` 会根据模型和数据集自动地选择训练方式，即 对比学习 或者 CoSent 等\n",
    "4. `FineTuner` 会自动选择训练环境和超参数，`FineTuner` 会根据您的硬件环境自动选择训练设备，并根据模型、数据等各种信息自动建议最佳的超参数，lr, batch_size 等，当然您也可以自己手动进行调整\n",
    "5. `FineTuner` 会自动保存微调记录和模型，`FineTuner` 会根据您的设置自动使用您环境中的 wandb, tensorboard 等来记录微调过程，同时也会自动保存微调模型\n",
    "\n",
    "总结一下，`FineTuner` 会自动完成微调所需的各种工作，只要您的数据类型在 `FineTuner` 支持的范围内！\n",
    "\n",
    "那么，让我们看看 `FineTuner` 都支持哪些类型的数据吧。"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## FineTuner 支持的数据类型\n",
    "\n",
    "`FineTuner` 中 `dataset` 参数是一个可供迭代 (for 循环) 的数据集，每次迭代会返回一个样本，这个样本应该是以下三种格式之一：\n",
    "\n",
    "1. `PairRecord`，句对样本\n",
    "2. `TripletRecord`，句子三元组样本\n",
    "3. `ScoredPairRecord`，带有分数的句对样本"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import warnings\n",
    "from uniem.data_structures import RecordType, PairRecord, TripletRecord, ScoredPairRecord\n",
    "\n",
    "os.environ['TOKENIZERS_PARALLELISM'] = 'false'\n",
    "warnings.filterwarnings('ignore')\n",
    "print(f'record_types: {[record_type.value for record_type in RecordType]}')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### PairRecord\n",
    "\n",
    "`PairRecord` 就是句对样本，每一个样本都代表一对相似的句子，字段的名称是 `text` 和 `text_pos`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pair_record = PairRecord(text='肾结石如何治疗？', text_pos='如何治愈肾结石')\n",
    "print(f'pair_record: {pair_record}')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TripletRecord\n",
    "\n",
    "`TripletRecord` 就是句子三元组样本，在 `PairRecord` 的基础上增加了一个不相似句子负例，字段的名称是 `text`、`text_pos` 和 `text_neg`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "triplet_record = TripletRecord(text='肾结石如何治疗？', text_pos='如何治愈肾结石', text_neg='胆结石有哪些治疗方法？')\n",
    "print(f'triplet_record: {triplet_record}')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ScoredPairRecord\n",
    "\n",
    "`ScoredPairRecord` 就是带有分数的句对样本，在 `PairRecord` 的基础上添加了句对的相似分数(程度)。字段的名称是 `sentence1` 和 `sentence2`，以及 `label`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1.0 代表相似，0.0 代表不相似\n",
    "scored_pair_record1 = ScoredPairRecord(sentence1='肾结石如何治疗？', sentence2='如何治愈肾结石', label=1.0)\n",
    "scored_pair_record2 = ScoredPairRecord(sentence1='肾结石如何治疗？', sentence2='胆结石有哪些治疗方法？', label=0.0)\n",
    "print(f'scored_pair_record: {scored_pair_record1}')\n",
    "print(f'scored_pair_record: {scored_pair_record2}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 2.0 代表相似，1.0 代表部分相似，0.0 代表不相似\n",
    "scored_pair_record1 = ScoredPairRecord(sentence1='肾结石如何治疗？', sentence2='如何治愈肾结石', label=2.0)\n",
    "scored_pair_record2 = ScoredPairRecord(sentence1='肾结石如何治疗？', sentence2='胆结石有哪些治疗方法？', label=1.0)\n",
    "scored_pair_record3 = ScoredPairRecord(sentence1='肾结石如何治疗？', sentence2='失眠如何治疗', label=0)\n",
    "print(f'scored_pair_record: {scored_pair_record1}')\n",
    "print(f'scored_pair_record: {scored_pair_record2}')\n",
    "print(f'scored_pair_record: {scored_pair_record3}')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 小结\n",
    "\n",
    "`FineTuner` 支持的数据类型有三种，分别是 `PairRecord`，`TripletRecord` 和 `ScoredPairRecord`，其中 `TripletRecord` 比 `PairRecord` 多了一个不相似句子负例，而 `ScoredPairRecord` 是在 `PairRecord` 的基础上添加了句对的相似分数。\n",
    "\n",
    "只要您的数据集是这三种类型之一，`FineTuner` 就可以自动识别并使用。现在让我们看看实际的例子"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 示例：微调 M3E，医疗相似问题数据集\n",
    "\n",
    "现在我们假设我们想要 HuggingFace 上托管的 vegaviazhang/Med_QQpairs 医疗数据集上做微调，让我们先把数据集下载好"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "med_dataset_dict = load_dataset('vegaviazhang/Med_QQpairs')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "让我们查看一下 Med_QQpairs 的数据格式是不是在 `FineTuner` 支持的范围内"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(med_dataset_dict['train'][0])\n",
    "print(med_dataset_dict['train'][1])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们发现 Med_QQpairs 数据集正好符合我们 `ScoredPairRecord` 的数据格式，只是字段名称是 `question1` 和 `question2`，我们只需要修改成 `sentence1` 和 `sentence2` 就可以直接进行微调了"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "from uniem.finetuner import FineTuner\n",
    "\n",
    "dataset = load_dataset('vegaviazhang/Med_QQpairs')['train']\n",
    "dataset = dataset.rename_columns({'question1': 'sentence1', 'question2': 'sentence2'})\n",
    "\n",
    "#  Med_QQpairs只有训练集，我们需要手动划分训练集和验证集\n",
    "dataset = dataset.train_test_split(test_size=0.1, seed=42)\n",
    "dataset['validation'] = dataset.pop('test')\n",
    "\n",
    "finetuner = FineTuner.from_pretrained('moka-ai/m3e-small', dataset=dataset)\n",
    "fintuned_model = finetuner.run(epochs=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "训练过程完成后，会自动保存模型到 finetuned-model 目录下"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls finetuned-model/model"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 示例：微调 text2vec，猜谜数据集\n",
    "\n",
    "现在我们要对一个猜谜的数据集进行微调，这个数据集是通过 json line 的形式存储的，让我先看看数据格式吧。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_json('https://raw.githubusercontent.com/wangyuxinwhy/uniem/main/examples/example_data/riddle.jsonl', lines=True)\n",
    "records = df.to_dict('records')\n",
    "print(records[0])\n",
    "print(records[1])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个数据集中，我们有 `instruction` 和 `output` ，我们可以把这两者看成一个相似句对。这是一个典型的 `PairRecord` 数据集。\n",
    "\n",
    "`PairRecord` 需要 `text` 和 `text_pos` 两个字段，因此我们需要对数据集的字段进行重新命名，以符合 `PairRecord` 的格式。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Fintuner` 会根据模型名称自动识别模型类型，不需要额外处理。这里我们选择微调 text2vec-base-chinese-sentence 。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "from uniem.finetuner import FineTuner\n",
    "\n",
    "# 读取 jsonl 文件\n",
    "df = pd.read_json('https://raw.githubusercontent.com/wangyuxinwhy/uniem/main/examples/example_data/riddle.jsonl', lines=True)\n",
    "df = df.rename(columns={'instruction': 'text', 'output': 'text_pos'})\n",
    "\n",
    "\n",
    "# 指定训练的模型为 m3e-small\n",
    "finetuner = FineTuner.from_pretrained('shibing624/text2vec-base-chinese-sentence', dataset=df.to_dict('records'))\n",
    "fintuned_model = finetuner.run(epochs=3, output_dir='finetuned-model-riddle')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上面的两个示例分别展示了对 jsonl 本地 `PairRecord` 类型数据集，以及 huggingface 远程 `ScoredPair` 类型数据集的读取和训练过程。`TripletRecord` 类型的数据集的读取和训练过程与 `PairRecord` 类型的数据集的读取和训练过程类似，这里就不再赘述了。\n",
    "\n",
    "也就是说，你只要构造了符合 `uniem` 支持的数据格式的数据集，就可以使用 `FineTuner` 对你的模型进行微调了。\n",
    "\n",
    "`FineTuner` 接受的 dataset 参数，只要是可以迭代的产生有指定格式的字典 `dict` 就行了，所以上述示例分别使用 `datasets.DatasetDict` 和 `list[dict]` 两种数据格式。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 示例：微调 sentences_transformers\n",
    "\n",
    "`FineTuner` 在设计实现的时候也同时兼容了其他框架的模型，而不仅仅是 uniem！比如， `sentece_transformers` 的 [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) 是一个广受欢迎的模型。现在我们将使用前文提到过的 Med_QQpairs 对其进行微调。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "from uniem.finetuner import FineTuner\n",
    "\n",
    "dataset = load_dataset('vegaviazhang/Med_QQpairs')\n",
    "dataset = dataset.rename_columns({'question1': 'sentence1', 'question2': 'sentence2'})\n",
    "finetuner = FineTuner.from_pretrained('sentence-transformers/all-MiniLM-L6-v2', dataset=dataset)\n",
    "fintuned_model = finetuner.run(epochs=3, batch_size=32)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 示例：从头训练\n",
    "\n",
    "我们除了可以在训练好的 Embedding 模型基础上进行微调外，还可以选择从一个预训练模型开始训练，这个预训练模型可以是 BERT，RoBERTa，T5 等。\n",
    "\n",
    "这里，我们将通过 datasets 的 streaming 的方式来使用一个数据规模较大的数据集，并对一个只有两层的 BERT `uer/chinese_roberta_L-2_H-128` 进行微调。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "from transformers import AutoTokenizer\n",
    "\n",
    "from uniem.finetuner import FineTuner\n",
    "from uniem.model import create_uniem_embedder\n",
    "\n",
    "dataset = load_dataset('shibing624/nli-zh-all', streaming=True)\n",
    "dataset = dataset.rename_columns({'text1': 'sentence1', 'text2': 'sentence2'})\n",
    "\n",
    "# 由于是从头训练，我们需要自己初始化 embedder 和 tokenizer。当然，我们也可以选择新的 pooling 策略。\n",
    "embedder = create_uniem_embedder('uer/chinese_roberta_L-2_H-128', pooling_strategy='cls')\n",
    "tokenizer = AutoTokenizer.from_pretrained('uer/chinese_roberta_L-2_H-128')\n",
    "\n",
    "finetuner = FineTuner(embedder, tokenizer=tokenizer, dataset=dataset)\n",
    "fintuned_model = finetuner.run(epochs=3, batch_size=32, lr=1e-3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 示例：SGPT\n",
    "\n",
    "`FineTuner` 在设计实现的时候还提供了更多的灵活性，以 [SGPT](https://github.com/Muennighoff/sgpt) 为例，SGPT 和前面介绍的模型主要有以下三点不同：\n",
    "\n",
    "1. SGPT 使用 GPT 系列模型（transformer decoder）作为 Embedding 模型的基础模型\n",
    "2. Embedding 向量的提取策略不再是 LastMeanPooling ，而是根据 token position 来加权平均\n",
    "3. 使用 bitfit 的微调策略，在微调时只对模型的 bias 进行更新\n",
    "\n",
    "现在我们将效仿 SGPT 的训练策略，使用 Med_QQpairs 对 GPT2 进行微调。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "from transformers import AutoTokenizer\n",
    "\n",
    "from uniem.finetuner import FineTuner\n",
    "from uniem.training_strategy import BitFitTrainging\n",
    "from uniem.model import PoolingStrategy, create_uniem_embedder\n",
    "\n",
    "dataset = load_dataset('vegaviazhang/Med_QQpairs')\n",
    "dataset = dataset.rename_columns({'question1': 'sentence1', 'question2': 'sentence2'})\n",
    "embedder = create_uniem_embedder('gpt2', pooling_strategy=PoolingStrategy.last_weighted)\n",
    "tokenizer = AutoTokenizer.from_pretrained('gpt2')\n",
    "finetuner = FineTuner(embedder, tokenizer, dataset=dataset)\n",
    "finetuner.tokenizer.pad_token = finetuner.tokenizer.eos_token\n",
    "finetuner.run(epochs=3, lr=1e-3, batch_size=32, training_strategy=BitFitTrainging())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "uniem",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.11"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
